-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guidance for CO HDF/NetCDF #121
Conversation
|
@abarciauskas-bgse this is a great 1st version, a few questions and suggested fixes
TODO: We'll open a different ticket for a notebook page about writing files from Python etc... rather than always having to repack existing files all the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of fixes #121 (comment)
This looks good! just some minor additions that I'm not sure are relevant for a first pass on this.
If I can think of more stuff I'll add it later (will be out next week). We also need to finish our tech report on IS2 and CO-HDF5, I think it should be ready for AGU. |
@betolink @wildintellect thanks for the feedback. I have some AGU prep to do but once that is done I will address the comments. |
@betolink thank you so much for these detailed comments. I have some comments and questions I'm hoping to help me sort out the details...
I'm reading a bit more documentation and now I am confused - so I'm hoping to clarify. Using h5repack it appears there is just one Secondly, I think if I had this level of detail I should also clarify that pages are different from chunks. If I understand correctly, HDF5 will create "pages" of data using the page size but then the raw data itself could also be chunked, so presumably the chunks will always be smaller than the page sizes. Is this a correct understanding? Also, are we concerned about increases in file size purely from a cost for storage perspective? My understanding was that for performance total file size doesn't matter, as long as we can just grab reasonably sized chunks from the file.
I see in the hdf5 library there is H5Pset_page_buffer_size and in h5py you can set
This is interesting but I'm still trying to understand it and so not sure how to include it in a way that is useful to readers. I think for the purposes of this first draft I will omit it if that's ok with you.
I'll add these in as notes.
Is the tech report out? If so I will link to it for sure. |
…d-optimized-geospatial-formats-guide into add-co-hdf5-guidance
Chiming in, since I'm actively working on applying this guidance to some next-generation data products from GMAO. Others, please correct me if I'm wrong! My mental model of paged aggregation is that, when enabled, it's basically the smallest unit of data that HDF5 can read or write; i.e., you can't read or write part of a page. All the consequences of inappropriately set page sizes flow from that.
I've never seen any HDF5 person mention different page sizes for metadata vs. (chunked) data. I think the two things you're linking to here refer to a different (not page-based) storage management strategy. But, it would be awesome of HDF5 could have more flexible page sizes!
My understanding is: Chunk sizes should be smaller than page sizes, but I don't think it's required; you can split chunks across multiple pages. Otherwise, HDF5's tiny default page size (4 KB?) would fail for most datasets.
My guess is that there might be a minor performance penalty for retrieving unused data (because you have to download/read more data than you actually need), but it'll be negligible in most cases. So yes, the primary concern with large page sizes is that they inflate overall file size (and therefore storage cost). But, since lots of NASA data are big, that's a very important consideration! A 10% increase in NASA's ~140 PB catalog is ~14 PB, which is multiple big missions'-worth of data! |
@ashiklom thank you so much for chiming in! These thoughts are super helpful and interested to know how the GMAO product development goes.
That is a helpful simplification, thank you.
I think you're right, that these API methods are for a different file space management strategy.
👍🏽
👍🏽 I have incorporated most of these comments into a new box HDF5 File Space Management Strategies |
I concur with all the things @ashiklom said. If chunk sizes are larger than page sizes they will be tracked separately, so page aggregation won't be applied to them, which is bad. I want to dive into a geo-spec for HDF5, how can we rechunk different collections and add the geo-metadata to improve access even more, something I talked to Aleksandar and Patrick. The technical report on ATL03 is almost there, I think I'll use the holidays to finish it. I'm not sure about funding yet but after talking to Brianna(NASA) I think the Cloud Native summit in April would be a great place to present it. |
@betolink thank you for sharing the tech report, it looks great. Just one question:
Do you mean that there will be both metadata on pages AND chunks? And why is this bad, besides an increase in metadata? Is it because of chunk over-reading when reading multiple pages? Sorry this is the first time I'm hearing about this and curious about how it works. In the technical report it says "Chunk sizes cannot be larger than the page size", which seems contradictory to what we are discussing here (that chunk size can be larger than page sizes but it slows things down). |
@wildintellect ok I have incorporated comments to date. I am happy to merge and publish and we can update with new feedback as it arrives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have 1 minor question. Is it bad to drop all references to alternatives to Cloud Optimizing (aka services) like Hyrax, OpenDap etc? Should we be saying why we think cloud optimized is better but that these do exist as alternatives?
Good question, I see cloud native formats as a better long-term solution to transformation services, although these services are needed in some cases, they too will benefit of the data being in cloud optimized formats.
They can be larger but the driver won't access the chunks using the single page size approach, they will be as if they were in a regular HDF5, not bad if chunk sizes are really large. We only ran into one case for ICESat-2, the page size being 8MB and a dataset had 10 MB chunks, the smaller chunks were grouped into pages, the 10 MB were not. Since they are big enough the performance was not degraded. I think it was one of the 2 atmospheric datasets. |
We actually do cover services generally in the home page:
But I think adding this sentence I just added to the introduction of the CO HDF5/NetCDF-4 strengthens the intro by providing a reason for cloud-optimizing:
|
Thanks @betolink for helping out here - I hope you don't mind me pursuing this question about chunk sizes and page sizes. My reasoning may be wrong or I'm missing a scenario but I'm not sure I understand how having chunks larger than page sizes would degrade read performance, here are some scenarios:
|
@wildintellect I'm going to go ahead and merge and I can incorporate any add'l feedback from @betolink and @ajelenak as it comes. Thank you for reviewing it! |
Adds long overdue and much requested guidance on cloud-optimizing HDF(5) and NetCDF(-4).
I've added as co-authors @ajelenak and @ashiklom but also cited @bilts @betolink and @andypbarrett, so tagging all for review.